Statistical Graphics for High-Dimensional Data

Susan VanderPlas

Statistics Department


Iowa State University

Does this data have homogeneous variance?

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Statistical Graphics

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Statistical Graphics Literature

History of Graph Criticism:

Examples of Good Graphics:

Guidelines for Creating Good Graphs:

Statistical Graphics Literature

Ranking of Simple Graphical Tasks

Visual Inference

Dissertation

Do any of these plots have homogeneous variance?

4

Which Plot has Homogeneous Variance?

Sine Illusion

Sine Illusion - Explained

Perception is optimized for three dimensions.
Our brains sometimes inappropriately apply 3d heuristics to 2d images, producing optical illusions.

Sine Illusion

  • Case Study: DW, an individual who lacks binocular depth perception, is immune to the illusion

  • Subconscious (can’t be “un-seen”)

  • Affects perception of variability or height
    • Candlestick plots (finance)
    • Time Series
    • Scatter plots (nonlinear trend)
    • Streamgraphs
    • Stacked area plots


Source
Source
Source

Solution 1: Transform X


From: Signs of the Sine Illusion: Why we need to care (JCGS, 2015)

\((f \circ T)(x) = a + (b-a)\left(\int_{a}^x |f^\prime(z)| dz\right)/\left(\int_{a}^{b}|f^\prime(z)| dz\right)\) \((f \circ T_w)(x) = (1-w) \cdot x + w \cdot (f \circ T)(x)\)

Solution 2: Transform Y


From: Signs of the Sine Illusion: Why we need to care (JCGS, 2015)

\(l_{new}(x_0) = l_{old} \sqrt{1 + f^\prime(x_0)^2}\) \(l_{new_w}(x) = (1-w) \cdot l_{old} + w \cdot l_{new}(x)\)

Experimental Validation


  • Corrections validated experimentally
  • Data collected using Amazon Mechanical Turk
  • 206 participants completed 1374 trials in 4 days
  • Goals:
    • Identify range of acceptable weight values
    • Examine whether weight values are specific to individuals or consistent within the population

Results

  • Histograms show estimated parameter values
  • Lines show individual estimated values

  • w=0 and w=1 (fully corrected) are not acceptable weights.
  • Weight values are similar across correction type and individual

Findings

  • Sine Illusion affects our perception of heteroskedasticity in statistical plots

  • Corrections are effective at removing the illusion’s effects

  • Partial correction still is effective

  • “Don’t distort the data”: We need to be concerned with psychological distortion

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Big Data: Challenges and Graphics

Big Data

Visualization is an important tool for working with big data

Adaptations must be made:

  • Overplotting (large \(n\))
  • High-dimensional data (large \(p\))
  • Distributed/multi-source data, hierarchical data
  • No solution (binning, dimension reduction, tours) works for every situation

Interactive Graphics

  • Provide additional information in response to user action

  • Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)

  • Accommodate complex data structures

BUT…


Web-based interactive graphics may be even more size-sensitive than static graphics.

Interactive Visualization

  • Lacks the rigor of a grammar of interactivity

  • Design is a function of necessity (for now!), which can lead to sub-optimal graphics
    • Interactivity vs. Animation vs. Static Plots
    • Many types of interactivity, with different use cases:
      Brushing, linked plots, subsetting, zoom-and-filter

  • Perceptual research is limited
    • Extremely specific use cases
    • Low-level psychological effects
    • Testing paradigms are somewhat difficult

Interactive Visualization of Soybean Population Genetic Data

Soybean Project: People and Institutions

Overall Project Goals:

  • Understand historical yield increases
    100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
  • Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity

  • Communicate analysis results intuitively:
    • Target: Soybean farmers, plant geneticists
    • Provide full results (tables) and graphical summaries
    • Interface with existing databases and web resources

Data


  • Sequencing Data (79 varieties, 75GB processed and compressed)

  • Field Trials (168 varieties, 30 varieties with genetic data)

  • New crosses with highest yield varieties
    (sequencing + field trials)

  • Genealogy as reported in the breeding literature (1600 varieties)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)

Visualizing SNPs:

  • Huge number of interesting genes (70 million ID’d SNPs)
  • 79 varieties, 20 chromosomes
  • Phenotype and genealogy information
  • Researchers tend to work on gene subsets:
    Must be able to zoom and filter
  • Optimized files for SNP results are still large (10 GB) and require significant computational resources

Above all, need an interface to allow people to pull new discoveries from the data systematically.

Visualizing SNPs

  • SNP: Single Nucleotide Polymorphism, a single basepair mutation
    (A -> T, G -> A, C -> G)
  • Shiny applet: Responsive applet for user-directed data subsets
  • Show multiple levels of detail (less detail = lower computational load)
  • Provide resources in the applet for user exploration (not just a reference tool)

Applet Design

SNP Population Distribution

SNP Applet Overview

Density of SNPs: Chromosome Level

SNP Density

Individual SNPs: Comparing Varieties

Variety-Level SNP Browser

Genealogy and Phenotypes

Link

SNP Linked Plots

Interactive Plot Design

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Conclusions

Next Steps

  • User Studies of Interactive Graphics
    • Eye Tracking
    • Click Recording
    • Content Questions
    • “At what point do humans get overloaded?”

  • Color Perception for Statistical Plots
    • Colorbrewer palettes for maps
    • dichromat R package to simulate colorblindness
    • Need for validated color schemes that work well for scatterplots, bar charts, and other statistical plots

  • Hierarchy of Visual Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; No color

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; With color

Other Projects

  • Animint - Extends the ggplot2 implementation of the grammar of graphics to interactive plots

  • USDA Soybean Population Genetics Research
    • Analysis of copy-number variants
    • Genome-wide association studies of identified SNPs
    • Genealogy database

  • Data Aggregation
    • Craigslist ads
    • OkCupid
    • Location-based energy prices

Summary

  • Visualization research is inherently interdisciplinary
  • Statistical graphics makes unique contributions to visualizing variation in data
  • Statistical graphics will evolve to address new big data challenges
  • Need to quantify perception to better evaluate graphs

Acknowledgements

Computation

  • dplyr/plyr
  • reshape2/tidyr
  • CN.MOPS: CNV identification in populations of genetic data

Acknowledgements

Visualization Software

  • ggplot2
  • Animint
    d3 interactive web graphics using ggplot2 syntax in R
  • Shiny (RStudio) interactive web applets
  • Reveal.js (slides) with Rmarkdown and knitr

Acknowledgements

People

  • Heike Hofmann
  • Di Cook
  • Michelle Graham
  • Lindsay Rutter

Other Research

Visual Reasoning

  • Graphics research often uses the lineup protocol, a hypothesis test analogue for static graphics.

  • Goal: Understand correlation between graphical perception, lineup performance, mathematical reasoning, and classification skill.
Statistical Lineups

Visual Reasoning

Conclusion: Lineups are an inductive classification task using graphics; performance is not seriously impacted by spatial ability (outside of general aptitude).

Figure Classification Task

Graphical Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; No color

Graphical Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; With color